Getting started

suppressPackageStartupMessages(library(gapminder))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(plotly))
suppressPackageStartupMessages(library(maps))
library(extrafont)
## Registering fonts with R

Part 1: Factor management

Am I working with factors?

is.factor(gapminder$continent)
## [1] TRUE
is.factor(gapminder$country)
## [1] TRUE
glimpse(gapminder)
## Observations: 1,704
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...

Yes, looks like both continent and country are factors. Using glimpse, I can confirm that those are the only two factors in the gapminder tibble which contains 1,704 observations of six variables.

Drop factors and levels: Oceania

First, I’ll remove the continent Oceania. It only contains two countries which makes it a less interesting comparison than the other continents. I’m going to skip some piping here even though it will make the code slightly longer, because I want to separate the data manipulation from the sanity checks.

# remove Oceania
drop_ocea <- gapminder %>% 
  filter(continent != "Oceania")

# check the number of rows
glimpse(drop_ocea)
## Observations: 1,680
## Variables: 6
## $ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, ...
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia...
## $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992...
## $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.8...
## $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 1488...
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 78...
# check levels of continent factor
unique(drop_ocea$continent)
## [1] Asia     Europe   Africa   Americas
## Levels: Africa Americas Asia Europe Oceania

from glimpse(), I can see that the tibble now only contains 1,680 observations, so we have successfully removed rows. From unique(), I can see that it is the Oceania entries that have been removed from the dataset but Oceania remains a level in the factor.

# remove Oceania factor level
no_ocea <- drop_ocea %>%
  mutate(continent = fct_drop(continent))

# check if Oceania remains as a factor level
is.factor(no_ocea$continent)
## [1] TRUE
unique(no_ocea$continent)
## [1] Asia     Europe   Africa   Americas
## Levels: Africa Americas Asia Europe

Great, looks like continent is still a factor and now Oceania has been removed as a level. For my check, I decided against using str() because it provides too much information when, as in this case, I have fairly targeted questions about the data.

Reorder levels based on the data

As we can see from unique() above, the columns are ordered alphabetically, which is pretty arbitrary. I’m going to reorder the continents by standard deviation and pipe this into a violin plot.

no_ocea %>% 
  mutate(continent = fct_reorder(continent, pop, .fun = sd)) %>% 
  ggplot(aes(continent, pop, fill = continent)) +
  scale_y_log10() + 
  geom_violin() +
  labs(title = "Country-level population by continent, 1952 - 2007",
       subtitle = "Continents ordered from lowest to highest population standard deviation",
       x = "Continent",
       y = "Population")

We can also create a graph ordered by minimum population.

no_ocea %>% 
  mutate(continent = fct_reorder(continent, pop, .fun = min)) %>% 
  ggplot(aes(continent, pop, fill = continent)) +
  scale_y_log10() + 
  geom_violin() +
  labs(title = "Country-level population by continent, 1952 - 2007",
       subtitle = "Continents ordered from lowest to highest minimum population",
       x = "Continent",
       y = "Population")

This demonstrates something about the way ggplot2 assigns fill colors– they’re clearly linked to the order that variables will be plotted (e.g. the first entry will always be pink), not something inherent to the data.

Part 2: File I/O

Export data to .csv

Before I write the data frame to a csv, let’s filter to get a more reasonable data set to work with. I’m going to filter to the Americas only, with data at 10 year intervals instead of 5.

amer_7 <- gapminder %>% 
  filter(continent == "Americas") %>% 
  filter(str_detect(year, "7$"))
  
amer_7 %>% 
  ggplot(aes(country, lifeExp)) +
  geom_point() +
  coord_flip() + #flip axes
  labs(title = "Life expectancy in the Americas, 1957-2007",
       x = "Life expectancy",
       y = "Country") +
  theme_light()

When I plot the filtered data, it displays alphabetically – not super helpful for understanding trends in the data.

I’m going to order the country data by median life expectancy to get a better idea of overall trends.

amer_7_ord <- amer_7 %>%   
  mutate(country = fct_reorder(country, lifeExp, .fun = median))  # reorder country by median lifeExp

amer_7_ord %>% 
  ggplot(aes(country, lifeExp)) +
  geom_point() +
  coord_flip() +
  labs(title = "Life expectancy in the Americas, 1957-2007",
       x = "Life expectancy",
       y = "Country") +
  theme_light()

This plot is way more helpful!

Now, let’s experiment with exporting the data frame I’ve created to a .csv. Will the ordering be preserved if I re-import and plot it?

write_csv(amer_7_ord, "amer_7_ord.csv") 

I’m going to re-import the same .csv and plot it in the same way to see if it retains the ordering.

read_csv("amer_7_ord.csv") %>%  #import .csv
  ggplot(aes(country, lifeExp)) +
  geom_point() +
  coord_flip() +
  labs(title = "Life expectancy in the Americas, 1957-2007",
       x = "Life expectancy",
       y = "Country") +
  theme_light()
## Parsed with column specification:
## cols(
##   country = col_character(),
##   continent = col_character(),
##   year = col_integer(),
##   lifeExp = col_double(),
##   pop = col_integer(),
##   gdpPercap = col_double()
## )

Nope, the ordering is not preserved by the .csv.

Export data to RDS

Let’s try using saveRDS() and readRDS(), and use identical() to test if the files are the same. If you want to run this segment of the code at home, you’ll need to specify a different file directory for the output.

saveRDS(amer_7_ord, '/Users/miellemichaux/Documents/STAT54X/hw05/amer_7_ord.rds') 

amer_7_ordRDS <- readRDS('/Users/miellemichaux/Documents/STAT54X/hw05/amer_7_ord.rds') 

identical(amer_7_ord, amer_7_ordRDS) 
## [1] TRUE

Yes, the files appear to be the same, but I’ll plot the imported RDS just to be sure.

amer_7_ordRDS  %>% 
  ggplot(aes(country, lifeExp)) +
  geom_point() +
  coord_flip() +
  labs(title = "Life expectancy in the Americas, 1957-2007",
       x = "Life expectancy",
       y = "Country") +
  theme_light()

To summarize: RDS exports and imports preserve the factor order, but writing to a .csv does not.

Part 3: Visualization design

Remake at least one figure or create a new one, in light of something you learned in the recent class meetings about visualization design and color. Maybe juxtapose your first attempt and what you obtained after some time spent working on it. Reflect on the differences. If using Gapminder, you can use the country or continent color scheme that ships with Gapminder. Consult the dimensions listed in All the Graph Things.

http://stat545.com/graph00_index.html

Then, make a new graph by converting this visual (or another, if you’d like) to a plotly graph. What are some things that plotly makes possible, that are not possible with a regular ggplot2 graph?

Remake a figure using data viz principles

amer_lifeExp_plot <- amer_7_ord %>% 
  ggplot(aes(lifeExp, country)) +
  geom_point(color = "darkcyan") +
  coord_flip() +
  labs(title = "Life expectancy in the Americas, 1957-2007",
       x = "Life expectancy",
       y = "Country") +
  theme_light() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, color = "grey50")) +
  theme(axis.text.y = element_text(color = "grey50")) +
  theme(panel.grid.major.x= element_blank(), 
        panel.grid.minor.y = element_blank(),
        panel.border = element_blank(),
        axis.ticks = element_blank()) +
  theme(text=element_text(size = 10, family = "Arial", color = "grey30"))

amer_lifeExp_plot 

Adding plotly to the mix

I’m going to use plotly syntax to recreate the original graph.

amer_7_ord %>% 
  plot_ly(x = ~country, 
        y = ~lifeExp, 
        type = "scatter",
        mode = "markers",
        opacity = 0.2) %>% 
    layout(xaxis = list())

Plotly adds an empty Afghanistan entry here– perhaps because it’s still an empty factor level in this dataset.

Let’s try using factor_drop() to remove empty factor levels.

americas <- amer_7_ord %>% 
  mutate(country = fct_drop(country))

unique(americas$country)
##  [1] Argentina           Bolivia             Brazil             
##  [4] Canada              Chile               Colombia           
##  [7] Costa Rica          Cuba                Dominican Republic 
## [10] Ecuador             El Salvador         Guatemala          
## [13] Haiti               Honduras            Jamaica            
## [16] Mexico              Nicaragua           Panama             
## [19] Paraguay            Peru                Puerto Rico        
## [22] Trinidad and Tobago United States       Uruguay            
## [25] Venezuela          
## 25 Levels: Haiti Bolivia Guatemala Nicaragua El Salvador Honduras ... Canada
americas %>% 
  plot_ly(x = ~country, 
        y = ~lifeExp,
        type = "scatter",
        mode = "markers",
        hoverinfo = 'text',
        text = ~paste(country, '', lifeExp), # custom hover text
        yaxis = list(hoverformat = '.2f')) # trying to get number of hover decimals down but not working

To map, I’m going to modify some ggplot map code that I wrote for the Gapminder dataset for a previous assignment in ggplotly.

Data wrangling for mapping:

world <- map_data("world")

americas07 <-gapminder %>%
  filter(year == 2007) %>%
  filter(continent == "Americas") %>% 
  rename(region = country) %>% 
  mutate(region = as.character(region)) %>% 
  mutate(region = ifelse(region == "United States", "USA", region))

americasgeog <- right_join(world, americas07, by = "region") 

Interactive map in ggploty:

gg <- ggplot() + 
  geom_polygon(data = americasgeog,
               aes(x=long,
                   y = lat,
                   group = group,
                   fill = lifeExp,
                   text = paste(region, "</b> \n", round(lifeExp,1), "years"))) + 
  coord_map("mollweide") +
  scale_fill_distiller(palette = 4, direction = 1, "Life\nexpectancy") + 
  theme_void() +
  theme(panel.grid = element_blank()) + # remove x axis bar
  ggtitle("Life expectancy in the Americas, 2007")
## Warning: Ignoring unknown aesthetics: text
ggplotly(gg, tooltip = "text") #hover labels life expectancy

Part 4: Writing figures to file

Use ggsave() to explicitly save a plot to file. Then use [Alt text] (/path/to/img.png) to load and embed it in your report. You can play around with various options, such as:

Arguments of ggsave(), such as width, height, resolution or text scaling. Various graphics devices, e.g. a vector vs. raster format. Explicit provision of the plot object p via ggsave(…, plot = p). Show a situation in which this actually matters.

To specify which plot will be saved, I’m going to add plot =. This is usually a good idea, as if I don’t specify which plot, ggsave will use the most recent plot created. If I later re-order my code chunks or add another plot, it will now save the wrong one. In my opinion, it’s better to explicitly name the desired plot, even if it’s not totally necessary.

If I don’t specify a file destination, the exported files will end up in the same folder as my assignment 5 R project. I’ve specified my homework 5 repo as the destination for the png image, which I’ve linked to below.

I don’t think SVGs can be exported directly to github, so you’ll have to take my word for it that the svg export completed successfully on my home computer. Notice that as SVGs are vector images, there’s no need to include dimensions (width + height in units) or a resolution (e.g. 300 dots per inch).

#png 
ggsave("lifeExp6x4.png", plot = amer_lifeExp_plot, device = "png", path = "https://github.com/STAT545-UBC-students/hw05-MielleM/blob/master/test_images/", width = 6, height = 4, units = "in", dpi = 300)
## Warning in grDevices::dev.off(): not a supported scheme, no image data
## written
#svg
ggsave("lifeExp6x4.svg", plot = amer_lifeExp_plot, device = "svg")
## Saving 7 x 5 in image
#pdf
# ggsave("lifeExp6x4.pdf", plot = amer_lifeExp_plot, device = "pdf", dpi = 300)

Check out the plot that I uploaded to my github repo.

The pdf export had some issues with recognizing font type, so I haven’t uploaded it.

Thanks to: